go/mysql: performance optimizations in protocol encoding #16341

mattrobenolt · 2024-07-04T22:51:11Z

This employs a couple tricks that combined seemed fruitful:

Swapping to binary.LittleEndian.Put* on the basic calls gets us a free
boost while removing code. The main win from this swap is the slice
boundary check, resulting in a massive boost. I kept it inlined, but
added my own boundary checking in writeLenEncInt since swapping it
out here resulted in a very minor performance regression from the
current results. I assume from the extra coersion needed to the uint*
type, and another reslice.
Reslicing the byte slice early so all future operations work on
0-index rather than pos+ indexing. This seemed to be a pretty sizeable
win without needing to do more addition on every operation later to
determine the index, they get swapped out for constants.
Read path employs the same early reslicing, but already has explicit
bounds checks.
Rewrite writeZeroes to utilize the Go memclr optimization.

$ benchstat {old,new}.txt
goos: darwin
goarch: arm64
pkg: vitess.io/vitess/go/mysql
                                 │    old.txt     │               new.txt                │
                                 │     sec/op     │    sec/op     vs base                │
EncWriteInt/16-bit-10               0.4685n ±  0%   0.3516n ± 0%  -24.94% (p=0.000 n=10)
EncWriteInt/16-bit-lenencoded-10     2.049n ±  0%    2.049n ± 0%        ~ (p=0.972 n=10)
EncWriteInt/24-bit-lenencoded-10     1.987n ±  0%    2.056n ± 0%   +3.45% (p=0.000 n=10)
EncWriteInt/32-bit-10               0.7819n ±  0%   0.3906n ± 0%  -50.05% (p=0.000 n=10)
EncWriteInt/64-bit-10               1.4080n ±  0%   0.4684n ± 0%  -66.73% (p=0.000 n=10)
EncWriteInt/64-bit-lenencoded-10     3.126n ±  0%    2.051n ± 0%  -34.40% (p=0.000 n=10)
EncWriteZeroes/4-bytes-10           2.5030n ±  0%   0.3123n ± 0%  -87.52% (p=0.000 n=10)
EncWriteZeroes/10-bytes-10          4.3815n ±  0%   0.3120n ± 0%  -92.88% (p=0.000 n=10)
EncWriteZeroes/23-bytes-10          8.4575n ±  0%   0.3124n ± 0%  -96.31% (p=0.000 n=10)
EncWriteZeroes/55-bytes-10         20.8750n ± 10%   0.6245n ± 0%  -97.01%
EncReadInt/16-bit-10                 2.050n ±  0%    2.068n ± 1%   +0.90% (p=0.001 n=10)
EncReadInt/24-bit-10                 2.034n ±  0%    2.050n ± 0%   +0.76% (p=0.000 n=10)
EncReadInt/64-bit-10                 2.819n ±  1%    2.187n ± 0%  -22.41% (p=0.000 n=10)
geomean                              2.500n         0.8363n       -66.55%

Related issue:

#16789

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

vitess-bot · 2024-07-04T22:51:14Z

codecov · 2024-07-04T23:10:36Z

Codecov Report

Attention: Patch coverage is 98.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.71%. Comparing base (cb2d0df) to head (2a6a739).
Report is 1 commits behind head on main.

Files	Patch %	Lines
go/mysql/encoding.go	98.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16341      +/-   ##
==========================================
- Coverage   68.72%   68.71%   -0.02%     
==========================================
  Files        1547     1547              
  Lines      198267   198317      +50     
==========================================
+ Hits       136264   136271       +7     
- Misses      62003    62046      +43

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

go/mysql/encoding.go

dbussink · 2024-07-05T06:01:51Z

You'll need to update the commits and force push to ensure the DCO sign off.

dbussink · 2024-07-05T09:22:12Z

@mattrobenolt Is clearing this way faster than using the following?

func writeZeroes(data []byte, pos int, len int) int {
	data = data[pos : pos+len]
	for i := range data {
		data[i] = 0
	}
	return pos + len
}

The Go compiler recognizes that pattern and optimizes it into a CALL runtime.memclrNoHeapPointers(SB) essentially. Is that just as fast or even faster? See also golang/go#5373 for that pattern.

mattrobenolt · 2024-07-05T20:01:59Z

@dbussink lol so

x_test.go source

package main

import "testing"

func writeZeroesVitessMain(data []byte, pos, len int) int {
	for i := 0; i < len; i++ {
		data[pos+i] = 0
	}
	return pos + len
}

func writeZeroesSpecialized23(data []byte, pos int) int {
	data = data[pos:]

	_ = data[22]
	data[0] = 0
	data[1] = 0
	data[2] = 0
	data[3] = 0
	data[4] = 0
	data[5] = 0
	data[6] = 0
	data[7] = 0
	data[8] = 0
	data[9] = 0
	data[10] = 0
	data[11] = 0
	data[12] = 0
	data[13] = 0
	data[14] = 0
	data[15] = 0
	data[16] = 0
	data[17] = 0
	data[18] = 0
	data[19] = 0
	data[20] = 0
	data[21] = 0
	data[22] = 0

	return pos + 23
}

func writeZeroesSpecialized10(data []byte, pos int) int {
	data = data[pos:]

	_ = data[9]
	data[0] = 0
	data[1] = 0
	data[2] = 0
	data[3] = 0
	data[4] = 0
	data[5] = 0
	data[6] = 0
	data[7] = 0
	data[8] = 0
	data[9] = 0

	return pos + 10
}

func writeZeroesMemclr(data []byte, pos, len int) int {
	end := pos + len
	data = data[pos:end]

	for i := range data {
		data[i] = 0
	}

	return end
}

func BenchmarkZeroes(b *testing.B) {
	buf := make([]byte, 128)

	b.Run("vitess-main/23-byte", func(b *testing.B) {
		for range b.N {
			_ = writeZeroesVitessMain(buf, 16, 23)
		}
	})

	b.Run("vitess-main/10-byte", func(b *testing.B) {
		for range b.N {
			_ = writeZeroesVitessMain(buf, 16, 10)
		}
	})

	b.Run("specialized/23-byte", func(b *testing.B) {
		for range b.N {
			_ = writeZeroesSpecialized23(buf, 16)
		}
	})

	b.Run("specialized/10-byte", func(b *testing.B) {
		for range b.N {
			_ = writeZeroesSpecialized10(buf, 16)
		}
	})

	b.Run("memclr/23-byte", func(b *testing.B) {
		for range b.N {
			_ = writeZeroesMemclr(buf, 16, 23)
		}
	})

	b.Run("memclr/10-byte", func(b *testing.B) {
		for range b.N {
			_ = writeZeroesMemclr(buf, 16, 10)
		}
	})
}

I put them all side by side and this is wild.

$ go test -v . -bench=.
goos: darwin
goarch: arm64
pkg: x
BenchmarkZeroes
BenchmarkZeroes/vitess-main/23-byte
BenchmarkZeroes/vitess-main/23-byte-10          138833698                8.634 ns/op
BenchmarkZeroes/vitess-main/10-byte
BenchmarkZeroes/vitess-main/10-byte-10          268413115                4.471 ns/op
BenchmarkZeroes/specialized/23-byte
BenchmarkZeroes/specialized/23-byte-10          570586390                2.125 ns/op
BenchmarkZeroes/specialized/10-byte
BenchmarkZeroes/specialized/10-byte-10          1000000000               0.6531 ns/op
BenchmarkZeroes/memclr/23-byte
BenchmarkZeroes/memclr/23-byte-10               1000000000               0.3263 ns/op
BenchmarkZeroes/memclr/10-byte
BenchmarkZeroes/memclr/10-byte-10               1000000000               0.3286 ns/op
PASS
ok      x       7.020s

That memclr optimization is really good.

Going to swap that out, which will let me get rid of the specialized versions. I didn't like that anyways.

This employs a couple tricks that combined seemed fruitful: * Swapping to binary.LittleEndian.Put* on the basic calls gets us a free boost while removing code. The main win from this swap is the slice boundary check, resulting in a massive boost. I kept it inlined, but added my own boundary checking in `writeLenEncInt` since swapping it out here resulted in a very minor performance regression from the current results. I assume from the extra coersion needed to the uint* type, and another reslice. * Reslicing the byte slice early so all future operations work on 0-index rather than pos+ indexing. This seemed to be a pretty sizeable win without needing to do more addition on every operation later to determine the index, they get swapped out for constants. * Read path employs the same early reslicing, but already has explicit bounds checks. * Rewrite `writeZeroes` to utilize the Go memclr optimization. ``` $ benchstat {old,new}.txt goos: darwin goarch: arm64 pkg: vitess.io/vitess/go/mysql │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ EncWriteInt/16-bit-10 0.4685n ± 0% 0.3516n ± 0% -24.94% (p=0.000 n=10) EncWriteInt/16-bit-lenencoded-10 2.049n ± 0% 2.049n ± 0% ~ (p=0.972 n=10) EncWriteInt/24-bit-lenencoded-10 1.987n ± 0% 2.056n ± 0% +3.45% (p=0.000 n=10) EncWriteInt/32-bit-10 0.7819n ± 0% 0.3906n ± 0% -50.05% (p=0.000 n=10) EncWriteInt/64-bit-10 1.4080n ± 0% 0.4684n ± 0% -66.73% (p=0.000 n=10) EncWriteInt/64-bit-lenencoded-10 3.126n ± 0% 2.051n ± 0% -34.40% (p=0.000 n=10) EncWriteZeroes/4-bytes-10 2.5030n ± 0% 0.3123n ± 0% -87.52% (p=0.000 n=10) EncWriteZeroes/10-bytes-10 4.3815n ± 0% 0.3120n ± 0% -92.88% (p=0.000 n=10) EncWriteZeroes/23-bytes-10 8.4575n ± 0% 0.3124n ± 0% -96.31% (p=0.000 n=10) EncWriteZeroes/55-bytes-10 20.8750n ± 10% 0.6245n ± 0% -97.01% EncReadInt/16-bit-10 2.050n ± 0% 2.068n ± 1% +0.90% (p=0.001 n=10) EncReadInt/24-bit-10 2.034n ± 0% 2.050n ± 0% +0.76% (p=0.000 n=10) EncReadInt/64-bit-10 2.819n ± 1% 2.187n ± 0% -22.41% (p=0.000 n=10) geomean 2.500n 0.8363n -66.55% ``` Signed-off-by: Matt Robenolt <matt@ydekproductions.com>

mattrobenolt · 2024-07-05T20:13:01Z

@dbussink updated to the memclr optimization as well as benchmarks in the PR description.

deepthi · 2024-07-08T18:44:19Z

Nice work!

mattrobenolt requested review from harshit-gangal, systay and mattlord as code owners July 4, 2024 22:51

github-actions bot added this to the v21.0.0 milestone Jul 4, 2024

mattrobenolt force-pushed the speedup-mysql-encoding branch 6 times, most recently from c23d96b to c549579 Compare July 5, 2024 00:56

mattrobenolt commented Jul 5, 2024

View reviewed changes

go/mysql/encoding.go Show resolved Hide resolved

mattrobenolt force-pushed the speedup-mysql-encoding branch from c549579 to 2a6a739 Compare July 5, 2024 20:10

dbussink approved these changes Jul 6, 2024

View reviewed changes

arthurschreiber approved these changes Jul 6, 2024

View reviewed changes

systay approved these changes Jul 8, 2024

View reviewed changes

systay merged commit d9475d8 into vitessio:main Jul 8, 2024
99 of 100 checks passed

mattrobenolt deleted the speedup-mysql-encoding branch July 8, 2024 18:00

mattrobenolt mentioned this pull request Jul 8, 2024

go/mysql: use clear builtin for zerofill #16348

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/mysql: performance optimizations in protocol encoding #16341

go/mysql: performance optimizations in protocol encoding #16341

mattrobenolt commented Jul 4, 2024 •

edited by systay

Loading

vitess-bot bot commented Jul 4, 2024

codecov bot commented Jul 4, 2024 •

edited

Loading

dbussink commented Jul 5, 2024

dbussink commented Jul 5, 2024

mattrobenolt commented Jul 5, 2024 •

edited

Loading

mattrobenolt commented Jul 5, 2024

deepthi commented Jul 8, 2024

go/mysql: performance optimizations in protocol encoding #16341

go/mysql: performance optimizations in protocol encoding #16341

Conversation

mattrobenolt commented Jul 4, 2024 • edited by systay Loading

Related issue:

Checklist

vitess-bot bot commented Jul 4, 2024

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov bot commented Jul 4, 2024 • edited Loading

Codecov Report

dbussink commented Jul 5, 2024

dbussink commented Jul 5, 2024

mattrobenolt commented Jul 5, 2024 • edited Loading

mattrobenolt commented Jul 5, 2024

deepthi commented Jul 8, 2024

mattrobenolt commented Jul 4, 2024 •

edited by systay

Loading

codecov bot commented Jul 4, 2024 •

edited

Loading

mattrobenolt commented Jul 5, 2024 •

edited

Loading